30 research outputs found

    Web Archive Services Framework for Tighter Integration Between the Past and Present Web

    Get PDF
    Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization

    ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

    Full text link
    Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.Comment: JCDL 2016, Newark, NJ, US

    Profiling Web Archive Coverage for Top-Level Domain and Content Language

    Get PDF
    The Memento aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile twelve public web archives using data from a variety of sources (the web, archives' access logs, and full-text queries to archives) and discover that only sending queries to the top three web archives (i.e., a 75% reduction in the number of queries) for any request produces the full TimeMaps on 84% of the cases.Comment: Appeared in TPDL 201

    How Much of the Web Is Archived?

    Full text link
    Although the Internet Archive's Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection development policies, these archives have varying levels of overlap with each other. While individual archives can be measured in terms of number of URIs, number of copies per URI, and intersection with other archives, to date there has been no answer to the question "How much of the Web is archived?" We study the question by approximating the Web using sample URIs from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the number of copies of the sample URIs exist in various public web archives. Each sample set provides its own bias. The results from our sample sets indicate that range from 35%-90% of the Web has at least one archived copy, 17%-49% has between 2-5 copies, 1%-8% has 6-10 copies, and 8%-63% has more than 10 copies in public web archives. The number of URI copies varies as a function of time, but no more than 31.3% of URIs are archived more than once per month.Comment: This is the long version of the short paper by the same title published at JCDL'11. 10 pages, 5 figures, 7 tables. Version 2 includes minor typographical correction

    Profiling Web Archives

    Get PDF
    PDF of a powerpoint presentation from the 2014 International Internet Preservation Consortium (IIPC) General Assembly, Paris, France, May 21, 2014. Also available on Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1011/thumbnail.jp

    Tools for Managing the Past Web

    Get PDF
    PDF of a powerpoint presentation from the Archive-It Partners Meeting in Montgomery, Alabama, November 18, 2014. Also available on Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1032/thumbnail.jp

    Who Will Archive the Archives? Thoughts About the Future of Web Archiving

    Get PDF
    PDF of a powerpoint presentation from the Wolfram Data Summit 2013 in Washington D.C., September 5-6, 2013. Also available on Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1017/thumbnail.jp

    A next-generation liquid xenon observatory for dark matter and neutrino physics

    Get PDF
    The nature of dark matter and properties of neutrinos are among the most pressing issues in contemporary particle physics. The dual-phase xenon time-projection chamber is the leading technology to cover the available parameter space for weakly interacting massive particles, while featuring extensive sensitivity to many alternative dark matter candidates. These detectors can also study neutrinos through neutrinoless double-beta decay and through a variety of astrophysical sources. A next-generation xenon-based detector will therefore be a true multi-purpose observatory to significantly advance particle physics, nuclear physics, astrophysics, solar physics, and cosmology. This review article presents the science cases for such a detector

    A Next-Generation Liquid Xenon Observatory for Dark Matter and Neutrino Physics

    Get PDF
    The nature of dark matter and properties of neutrinos are among the mostpressing issues in contemporary particle physics. The dual-phase xenontime-projection chamber is the leading technology to cover the availableparameter space for Weakly Interacting Massive Particles (WIMPs), whilefeaturing extensive sensitivity to many alternative dark matter candidates.These detectors can also study neutrinos through neutrinoless double-beta decayand through a variety of astrophysical sources. A next-generation xenon-baseddetector will therefore be a true multi-purpose observatory to significantlyadvance particle physics, nuclear physics, astrophysics, solar physics, andcosmology. This review article presents the science cases for such a detector.<br
    corecore